There are some base R functions that make good tools for exploring how probability works and for visualizing the properties of our data sets. Now that we are familiar with how to construct the basic object types in R and how to subset them, we can use some other useful functions to gather descriptive statistics from a data set.
Weeks 1 and 2 we learned about R objects:
We learned a lot of useful R functions:
The basic purpose of statistics is the collection, organization, and interpretation of data. These are interrelated concepts, so we will build our knowledge gradually in each domain, rather than attempt to master data collection (i.e. sampling and populations) before we advance to data organization.
Today, we’re working with a dataset that I generated with several plant taxa across a given depth. This basically simulates a stratigraphic study. We need to state some assumptions at the outset.
There is a ton of user-generated code out there that we can use on our own data. One way this code is distributed is through packages, which are hosted by CRAN. Authors of the packages are trusted to maintain them and this does not always happen, so be careful how many dependencies you write into your code. For today’s lab, we need to use one of the functions from the “vioplot” package. Use the code below to make sure it is installed and, if not, download the package.
packages <- c("vioplot") #Make an object listing packages
install.packages(setdiff(packages, row.names(installed.packages()))) #Install any packages that are in the object, but not fount on your machine.
#We can also do this manually by loading the vioplot library.
library(vioplot)
## Loading required package: sm
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#vioplot::vioplot() #This lets us call a function without loading the library and is how we will use it later.
We have already introduced a range of data types as we’ve become more familiar with R objects. It is helpful to have a working definition of major types of data in mind as we move forward.
Our primary interest is in measuring variation and those categories within which we expect more than two states are defined as variables. A strong proportion of paleoecology research is conducted by counting the number of different types of fossil organisms, making each taxon a different variable. These are fossil variables (this is my term!)
The other set of variables we are interested in relate to the sites/strata from which we collect this data. These data may be: synchronic, (coming from the present) describing present conditions of the site ; or diachronic, describing changes through time in parallel to the fossil variables. These are site variables (again, my terms). We’ve described some of the variables we might derive from sites/strata:
There are lots of kinds of data that we might collect, which we arrange in different ways. We can get a clearer sense of the importance of how data organization impacts subsequent analyses and interpretation.
We will read the fake pollen data set from last week and use some functions to gather insights. We are taking for granted how many variables we are looking at here and how many samples we have collected. We will come back to populations and sampling a little later. For now, let’s concern ourselves with taking some basic measurements from our data set.
fake_core <- read.csv("data/Fake_core.csv",
header = TRUE,
row.names = "depth")
ncol(fake_core) #How many variables?
## [1] 9
nrow(fake_core) #How many samples?
## [1] 100
sum(fake_core) #This is the sum of all the values in the table.
## [1] 21481
Descriptive statistics can offer some initial insights into numeric variables. Let’s look at a handful of basic functions for deriving these descriptive statistics. Let’s derive these for one of the classes of fossils, “Cyperaceae undiff.”. These can be computed with some base R functions, but we will also derive these manually to be sure we understand these measurements.
sum(fake_core$Cyperaceae.undiff.)
## [1] 2526
mean(fake_core$Cyperaceae.undiff.)
## [1] 25.26
sum(fake_core$Cyperaceae.undiff.)/length(fake_core$Cyperaceae.undiff.) #Mean is sum of observations divided by number of observations.
## [1] 25.26
The median value is the middle value of the ranked values for a given variable. It’s the value where 50% of the observations fall above and below it. This also teaches us a couple of new useful base R functions (unique() and order())
median(fake_core$Cyperaceae.undiff.)
## [1] 25
vls_cyp <- unique(fake_core$Cyperaceae.undiff.) #Unique lets us get all of the unique values from our variable.
vls_cyp
## [1] 10 11 6 19 9 15 12 14 21 17 7 18 16 22 20 25 32 23 36 30 29 28 34 31 26
## [26] 35 24 33 38 39 37 27 41 46 42 40 45
ord_cyp <- vls_cyp[order(vls_cyp)] #Here, we are taking those values and organizing them by rank-order. This works in a cool way.
ord_cyp
## [1] 6 7 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [26] 33 34 35 36 37 38 39 40 41 42 45 46
mdn_cyp <- ord_cyp[floor(length(ord_cyp)/2)] #Here, we grab the middle value of the range of numbers. There's differences for even and odd lists of values.
mdn_cyp
## [1] 25
We can use the apply() function to take another function (such as mean()) and apply it iteratively to a table of data. This makes our lives a great deal easier when working with tables of taxa. Below, we use apply, which will look for three arguments: a matrix object, a code designating rows/columns, and a function. You can substitute custom functions as well. Below, we apply sum, mean, and median to the entire fake core data set.
fake_sums <- apply(fake_core, 2, sum)
fake_mean <- apply(fake_core, 2, mean)
fake_median <- apply(fake_core, 2, median)
Some other functions also generate informative statistics for us. We can gather quantile estimates (fractions under which 25% of the observations fall) using summary(). Summary can also be used in the apply() format and yields a nice table if we make it an object.
summary(fake_core$Cyperaceae.undiff.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 16.00 25.00 25.26 34.00 46.00
fake_smmy <- apply(fake_core, 2, summary)
fake_smmy
## Cyperaceae.undiff. Typha Poaceae Alchornea Macaranga Celtis Blighia
## Min. 6.00 2.00 2.00 3.00 1.00 12.00 3.00
## 1st Qu. 16.00 4.75 27.00 32.00 16.00 21.00 8.75
## Median 25.00 6.00 35.00 48.00 31.00 25.50 11.00
## Mean 25.26 6.64 34.84 49.61 31.39 25.29 10.99
## 3rd Qu. 34.00 9.00 42.25 67.25 43.00 30.00 13.00
## Max. 46.00 12.00 69.00 115.00 82.00 38.00 19.00
## Sapotaceae.undiff. Guibourtia.demeusei
## Min. 3.00 2.00
## 1st Qu. 10.00 11.00
## Median 16.00 15.50
## Mean 15.44 15.35
## 3rd Qu. 20.00 19.25
## Max. 29.00 28.00
Using barplot(), we can visualize this information relatively quickly.
par(mai = c(1, 2, 0.5, 0.5))
barplot(fake_smmy,
beside = TRUE,
horiz = TRUE,
las = 2)
par(mar = c(5, 4, 4, 2) + 0.1)
These basic statistics show us a few things about our data. We can also plot the distribution of Cyperaceae undiff. counts across the 100 samples and show where the median and the mean fall along this scatter plot.
plot(fake_core$Cyperaceae.undiff.,
100:1,
xlim = c(0, 50),
pch = 21)
lines(c(mean(fake_core$Cyperaceae.undiff.),
mean(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 3)
lines(c(median(fake_core$Cyperaceae.undiff.),
median(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 2)
These counts are integers, so we could expect repetition of some numbers. Let’s see what the distribution of individual counts looks like using table(), which will produce a table of counts from a vector of data.
plot(fake_core$Cyperaceae.undiff.,
100:1,
xlim = c(0, 50),
pch = 19)
lines(table(fake_core$Cyperaceae.undiff.)*10, #Custom code to make this fit!
lty = 3,
col = "darkred",
lwd = 1) #Adding line width argument here, lets us define the width of lines we're plotting.
lines(c(mean(fake_core$Cyperaceae.undiff.),
mean(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 1,
lwd = 3)
lines(c(median(fake_core$Cyperaceae.undiff.),
median(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 1,
col = "blue",
lwd = 3)
We can use this basic method to look at other species in our table. Let’s look at Alchornea.
#We could always copy-paste the code we just used, but if we catch ourselves copy-pasting, that means that we may be better off with a function.
plot(fake_core$Alchornea,
100:1,
xlim = c(0, max(fake_core$Alchornea)),
pch = 19)
lines(table(fake_core$Alchornea)*10,
lty = 3,
col = "darkred",
lwd = 1) #Adding line width argument here, lets us define the width of lines we're plotting.
lines(c(mean(fake_core$Alchornea),
mean(fake_core$Alchornea)),
c(0, 100),
lty = 1,
lwd = 3)
lines(c(median(fake_core$Alchornea),
median(fake_core$Alchornea)),
c(0, 100),
lty = 1,
col = "blue",
lwd = 3)
When looking for ways to build functions, look for repetitions. We repeat the variable (Alchornea) and the functions applied to it. So what if R took any vector and ran all of these commands?
cust_plot = function(x){
par(mfrow = c(3, 3))
title_names = colnames(x)
for(i in 1:ncol(x)){
plot(x[,i],
1:length(x[,i]),
xlim = c(0, max(x[,i])),
pch = 19,
xlab = "NISP",
ylab = "sample")
lines(rep(mean(x[,i]),2),
c(0, length(x[,i])),
lwd = 3)
lines(rep(median(x[,i]),2),
c(0, length(x[,i])),
lwd = 3,
col = "blue")
lines((table(x[,i])/sum(table(x[,i])))*100,
lty = 3,
col = "darkred",
lwd = 1)
title(main = title_names[i])
axis(4, at = seq(10,100,10), labels = seq(0.1,1,0.1))
}
par(mfrow = c(1, 1))
}
Now that we have a function that we can use, all we have to do is switch up the inputs. With this limited number of taxa, we can actually plot all of them simultaneously.
cust_plot(fake_core)
Let’s find out what the overall picture of fossil identifications from this site looks. In our imaginary dataset, fossil identifications are distributed evenly across a 2-meter sedimentary sequence. We want to know a bit more about our total observations, so let’s calculate a sum from the table.
One of the first questions we might ask is about the frequency distribution of a given variable. We can get a sense of this quickly with the histogram function (“hist()”).
par(mfrow = c(2, 1))
plot(fake_core$Cyperaceae.undiff.,
100:1,
xlim = c(0, 50),
pch = 19)
lines(c(mean(fake_core$Cyperaceae.undiff.),
mean(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 3)
lines(c(median(fake_core$Cyperaceae.undiff.),
median(fake_core$Cyperaceae.undiff.)),
c(0, 100),
lty = 2)
#barplot(table(fake_core$Cyperaceae.undiff.))
hist(fake_core$Cyperaceae.undiff.,
breaks = 10,
xlim = c(0,50))
par(mfrow = c(1, 1))
Now that we have an idea of how we can view our counts as frequency distributions, we can explore some more ways to describe these results.
Base R has functions to derive all of these, but let’s also derive them manually so we understand what these statistics are telling us.
sd(fake_core$Cyperaceae.undiff.) #Calling a single column
## [1] 10.25495
fake_sds = apply(fake_core, 2, sd) #Using apply to get sds of every column
fake_sds
## Cyperaceae.undiff. Typha Poaceae Alchornea
## 10.254952 2.630512 12.074449 24.551284
## Macaranga Celtis Blighia Sapotaceae.undiff.
## 17.936980 5.560130 3.412677 6.049159
## Guibourtia.demeusei
## 6.192819
apply(fake_core, 2, var)
## Cyperaceae.undiff. Typha Poaceae Alchornea
## 105.164040 6.919596 145.792323 602.765556
## Macaranga Celtis Blighia Sapotaceae.undiff.
## 321.735253 30.915051 11.646364 36.592323
## Guibourtia.demeusei
## 38.351010
Standard deviations help us characterize variation in either a sample or a population of things. It is derived from some related values that are worth knowing.
Methods for describing variance.
mad(fake_core$Cyperaceae.undiff.)
## [1] 13.3434
mad_cyp = sum( abs(fake_core$Cyperaceae.undiff. - median(fake_core$Cyperaceae.undiff.) ) ) / length(fake_core$Cyperaceae.undiff.)
mad_cyp
## [1] 8.86
Above, we get some different results between R and a manual calculation because the mad() function uses a constant to adjust for “asmptotically normal consistency” (see help(mad)). Notice first that we can use mean or median to as a basis for absolute deviation for our vectors. This can be set with the “center =” and “constant =” arguments in mad() or within our manual calculation, we take the mean value of the absolute differences between the individual observations and their mean. This looks different from the formula above because I’ve subsumed the “sum()” and “/length(x)” functions by calling the entire calculation under “mean()”. If we calculate the value manually using the median (second half of code below), we get the same value as the “mad()” function, using a constant of 1.
mad(fake_core$Cyperaceae.undiff., center = mean(fake_core$Cyperaceae.undiff.), constant = 1)
## [1] 8.74
mad_cyp = mean(abs(fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.)))
mad_cyp
## [1] 8.8652
#Now we calculate the median average deviation.
mad_cyp = median(abs(fake_core$Cyperaceae.undiff - median(fake_core$Cyperaceae.undiff.)))
mad_cyp
## [1] 9
mad(fake_core$Cyperaceae.undiff., constant = 1)
## [1] 9
Here we are working with deviation from the mean or median. This is a logical enough way to express deviation in a set of numbers, but its applications and interpretation are limited (ex: how easy is it to compare between variables?). However, it does set the foundation of how we calculate variance, which is the sum of the squared deviations from the mean divided by the number of observations. For samples, we calculate the latter as the number of observations less one (N - 1). Base R has a function for variance “var()”, but we can also calculate it manually.
var(fake_core$Cyperaceae.undiff.)
## [1] 105.164
var_cyp = mean((fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.))^2)
#Perhaps R calculates variance for a sample and not a population...
var_cyp = sum((fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.))^2)/
(length(fake_core$Cyperaceae.undiff.)-1) #Note that we're taking N - 1 here.
var(fake_core$Cyperaceae.undiff.)
## [1] 105.164
var_cyp
## [1] 105.164
In this case, we’ve finally gotten identical values when compared with the base R function. But what does variance mean and how do we interpret it? R lets us generate a lot of fake and specifically structured data. Let’s look at some general features of variance by looking at systematically modified input.
my_df = sapply(1:10, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
seq(x, x*100, x)
})
apply(my_df, 2, var)
## [1] 841.6667 3366.6667 7575.0000 13466.6667 21041.6667 30300.0000
## [7] 41241.6667 53866.6667 68175.0000 84166.6667
plot(apply(my_df, 2, var), type = "o", pch = 21, bg = "orange")
Naturally, as the range of values gets larger the variance goes up. While the range grows by 100s, the variance does not grow linearly. Let’s look at a longer range of numbers and see how variance and the overall spread of the data interact.
my_df = sapply(1:100, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
seq(x, x*100, x)
})
apply(my_df, 2, var)
## [1] 841.6667 3366.6667 7575.0000 13466.6667 21041.6667
## [6] 30300.0000 41241.6667 53866.6667 68175.0000 84166.6667
## [11] 101841.6667 121200.0000 142241.6667 164966.6667 189375.0000
## [16] 215466.6667 243241.6667 272700.0000 303841.6667 336666.6667
## [21] 371175.0000 407366.6667 445241.6667 484800.0000 526041.6667
## [26] 568966.6667 613575.0000 659866.6667 707841.6667 757500.0000
## [31] 808841.6667 861866.6667 916575.0000 972966.6667 1031041.6667
## [36] 1090800.0000 1152241.6667 1215366.6667 1280175.0000 1346666.6667
## [41] 1414841.6667 1484700.0000 1556241.6667 1629466.6667 1704375.0000
## [46] 1780966.6667 1859241.6667 1939200.0000 2020841.6667 2104166.6667
## [51] 2189175.0000 2275866.6667 2364241.6667 2454300.0000 2546041.6667
## [56] 2639466.6667 2734575.0000 2831366.6667 2929841.6667 3030000.0000
## [61] 3131841.6667 3235366.6667 3340575.0000 3447466.6667 3556041.6667
## [66] 3666300.0000 3778241.6667 3891866.6667 4007175.0000 4124166.6667
## [71] 4242841.6667 4363200.0000 4485241.6667 4608966.6667 4734375.0000
## [76] 4861466.6667 4990241.6667 5120700.0000 5252841.6667 5386666.6667
## [81] 5522175.0000 5659366.6667 5798241.6667 5938800.0000 6081041.6667
## [86] 6224966.6667 6370575.0000 6517866.6667 6666841.6667 6817500.0000
## [91] 6969841.6667 7123866.6667 7279575.0000 7436966.6667 7596041.6667
## [96] 7756800.0000 7919241.6667 8083366.6667 8249175.0000 8416666.6667
plot(apply(my_df, 2, var),
type = "o",
pch = 21,
bg = "orange")
lines(apply(my_df, 2, mad)*1000) #Here we're adding the MAD values for the same dataset, multiplied by 1000 to make them visible.
Here, we see that the relationship between variance and the overall dispersion becomes linear after a total dispersion of about 60,000, which gives us a variance of about 3 million. So, we see here that populations which are structurally identical but which vary in their total quantity are going to have different variances. Specifically, the larger the total counts, the more variance we expect to find. We can also see that the median average deviation overestimates variance up to about 40,000 then continually under-estimates it afterwards. We will explore more about variance and population structure (i.e. composite frequency distribuion of variables in population) once we have mastered some more univariate statistics.
This brings us to the standard deviation, which is directly derived from variance by taking the square root of this value. Recall that our estimation of variance uses the squared value of the difference from the mean or median to avoid negative values. Here, we
We can also verify this with some population thinking using the function “rnorm()” which lets us create a normally-distributed sample using arguments for the mean and standard deviation.
my_df = sapply(1:1000, function(x){
rnorm(100, mean = x*10, sd = 10)
})
apply(my_df, 2, var)
## [1] 86.43602 81.04623 93.41824 95.68065 95.66001 99.99769 113.10026
## [8] 81.28159 128.71530 112.53739 104.25615 98.16949 100.96653 110.14285
## [15] 105.71425 121.05189 109.30490 100.26515 79.15460 75.88716 123.55001
## [22] 93.33296 110.47562 98.43086 98.36710 117.58494 79.34682 97.12199
## [29] 88.10484 109.36050 106.21436 77.77243 93.81907 128.50720 103.91883
## [36] 113.99669 97.36699 88.35946 84.26002 95.89895 128.55023 93.48490
## [43] 103.01394 121.41890 135.13755 94.02858 120.77102 108.49675 111.21228
## [50] 111.23721 90.15699 87.00697 85.15751 86.52536 97.55071 132.36754
## [57] 104.14708 100.43437 88.25676 103.76459 114.19481 80.71691 101.63133
## [64] 71.65728 118.20968 83.73053 100.45519 114.93790 116.48571 95.43646
## [71] 93.33557 122.04295 107.13707 127.32416 102.66290 107.47841 72.77745
## [78] 96.18975 109.72075 104.67338 112.36526 105.03316 109.01343 94.22310
## [85] 85.07706 104.95653 85.01930 84.35451 87.78385 95.27887 90.69149
## [92] 79.67000 91.16766 98.02041 124.74980 68.82379 97.31295 102.37231
## [99] 81.95056 81.57597 91.60206 92.13316 104.50111 95.31735 83.51081
## [106] 67.76467 120.10239 64.93364 97.46471 105.94795 117.97568 127.30798
## [113] 83.78085 100.94853 96.51296 156.15438 100.91837 90.70258 127.40791
## [120] 92.38971 86.65891 106.11827 100.72576 122.86657 100.55885 93.42898
## [127] 107.93477 101.27306 95.13220 101.84627 107.65515 92.89213 75.94973
## [134] 105.23910 104.59508 98.73208 96.32934 76.15213 106.27399 104.36712
## [141] 72.25246 110.31802 84.67983 108.49271 106.24657 122.48712 99.71773
## [148] 112.83271 126.04257 113.47697 82.70034 106.96613 106.23922 88.90858
## [155] 106.07925 78.09933 115.03284 88.31876 77.40177 120.39732 128.80509
## [162] 120.26136 86.76821 109.85476 94.09984 75.13921 91.97230 73.04053
## [169] 87.25041 128.03049 122.27502 88.29171 83.99959 99.05107 85.88326
## [176] 129.42616 109.21000 85.59916 79.09065 113.56279 156.22468 122.33529
## [183] 102.80078 88.29534 117.08089 68.50051 100.54075 106.72317 102.95375
## [190] 102.90449 94.87299 91.22126 99.95752 97.57702 117.84341 87.77351
## [197] 108.42536 89.37471 88.76265 103.22948 116.09970 85.43978 71.80068
## [204] 96.38089 99.36350 100.58137 63.60791 110.57376 108.53914 86.67373
## [211] 85.66010 96.31125 91.70900 85.73971 104.59393 90.83681 68.93931
## [218] 122.14028 105.98790 98.18373 82.00454 100.69844 99.73856 121.94230
## [225] 93.70943 93.49350 113.42169 87.80594 97.60107 87.19227 76.88486
## [232] 97.76113 87.94244 99.34400 78.64725 75.07558 111.87692 93.11177
## [239] 106.12544 91.76413 94.15350 103.53660 95.28555 124.42387 86.08723
## [246] 99.70312 82.48634 92.56274 94.99040 127.43439 81.55168 111.86642
## [253] 115.61685 110.02609 110.22373 95.51115 81.73646 107.74344 91.05578
## [260] 93.56213 76.03085 117.84707 98.93600 80.74134 121.25888 115.77838
## [267] 73.40461 82.39403 114.48086 106.62494 106.90053 112.75589 93.78639
## [274] 97.17336 113.99766 97.11000 94.80572 99.89328 136.98890 116.86775
## [281] 92.17392 123.75423 72.83874 95.58429 104.74528 107.32659 86.29507
## [288] 122.62567 84.60047 125.74078 96.98317 126.18612 103.73632 103.41841
## [295] 107.08689 101.88440 110.46844 90.18664 69.58966 103.34412 89.20149
## [302] 96.06303 115.28201 102.36455 107.23345 103.28339 96.64046 76.79390
## [309] 88.04955 91.55131 109.87846 114.59086 109.98601 104.41564 88.80262
## [316] 99.01543 104.30076 102.80867 89.86182 82.96672 83.33373 121.95394
## [323] 98.23224 95.83283 112.52469 88.34380 103.05825 86.98353 102.59296
## [330] 110.52671 83.93316 110.33018 88.83676 89.31054 111.26061 114.39614
## [337] 107.44773 122.08117 90.99283 105.89472 90.91029 91.77077 88.72255
## [344] 107.74745 127.13401 84.44586 114.79723 74.77863 90.38012 89.64295
## [351] 97.80544 87.10445 106.49056 110.03154 101.10828 96.57481 85.23905
## [358] 83.35995 111.58518 93.88186 79.37787 88.16846 79.82199 115.15273
## [365] 77.15309 113.99090 93.40345 95.18084 92.04032 90.44519 109.44970
## [372] 100.37145 116.31277 99.07728 107.16130 101.16107 97.38881 105.80202
## [379] 77.11363 88.66151 100.50205 98.66370 106.76322 85.13229 97.55205
## [386] 87.01910 89.50128 125.47783 84.61064 91.93117 98.95393 98.03147
## [393] 98.16912 97.07915 131.90068 74.00320 93.91382 97.54790 75.79964
## [400] 112.16659 65.28398 129.21690 106.88685 121.69780 124.37753 109.32511
## [407] 118.61710 111.90600 110.64936 94.91023 102.08340 131.66451 101.09490
## [414] 88.29159 105.73424 116.13886 119.26761 80.82900 104.48777 119.95614
## [421] 112.68472 129.04200 99.16974 109.02639 101.14142 124.24771 102.58746
## [428] 95.85567 134.99178 83.87573 109.76191 100.39517 87.70544 95.44434
## [435] 116.67350 103.76472 87.71350 90.59500 89.26986 114.65553 87.23517
## [442] 101.78515 89.37495 84.17464 100.80926 110.67932 103.88138 109.94931
## [449] 91.54486 102.76686 109.60249 121.12098 123.67542 95.26353 91.22173
## [456] 91.50200 98.59658 98.21503 113.69909 102.68951 108.28799 105.73251
## [463] 102.88767 79.03673 115.40431 87.58907 98.15618 128.72941 150.70157
## [470] 114.65431 100.13502 112.97821 137.62176 100.78473 93.43494 100.66211
## [477] 91.39532 139.66201 93.27215 77.30855 87.76868 72.48518 128.38140
## [484] 89.21958 98.41411 123.82889 130.37301 109.49721 106.99027 105.17574
## [491] 101.49021 104.72803 90.97111 113.97592 69.98639 113.36618 106.19404
## [498] 80.02546 100.90596 102.35829 89.91860 109.39629 87.97620 111.04518
## [505] 85.43552 86.31194 126.52674 89.00697 125.37620 119.51311 107.06096
## [512] 77.34128 117.66389 80.76687 91.46261 91.43518 99.46640 82.85122
## [519] 91.00910 98.95562 93.93577 77.05764 100.44236 100.15806 89.65028
## [526] 100.86770 112.19645 105.31244 97.55990 87.01601 95.69284 95.32765
## [533] 89.97343 88.26193 84.78384 78.79395 98.03362 111.82517 114.29908
## [540] 91.54353 81.30405 116.48110 80.38186 107.22636 116.82192 96.89300
## [547] 93.33704 97.72071 108.93979 117.50865 97.62689 132.82795 99.49734
## [554] 98.03547 112.80935 74.00203 111.10890 116.84272 107.94726 109.38157
## [561] 101.50200 115.97819 93.89402 98.29504 117.43219 103.50975 116.96671
## [568] 114.26581 104.43884 123.50840 115.55302 122.58958 90.19271 117.37840
## [575] 89.68613 91.30071 99.11512 103.59064 102.02149 95.88615 100.15853
## [582] 78.82541 111.36690 84.75319 105.05680 95.29644 112.93263 101.72469
## [589] 108.77982 103.38222 100.64346 98.68770 95.19561 83.65482 110.76632
## [596] 77.99273 107.16631 120.46479 103.79465 102.28528 94.25020 85.07987
## [603] 89.42790 93.44299 118.66587 90.42094 110.16549 100.72686 77.74932
## [610] 121.66585 89.78759 97.95056 100.85665 102.62528 85.74748 88.18785
## [617] 78.29265 82.98083 95.35656 115.34888 116.35757 86.34067 114.11872
## [624] 137.14334 80.17560 83.19457 97.53772 85.29696 101.71136 113.30411
## [631] 96.81134 123.43504 97.89140 118.40692 83.96352 82.54841 106.63084
## [638] 111.07636 103.34984 82.78528 131.29544 92.28498 99.85599 74.60012
## [645] 123.88343 96.20077 82.79962 76.18903 100.24252 96.98446 91.84358
## [652] 117.45632 99.60014 97.30362 107.36586 71.12319 92.42311 78.39419
## [659] 107.37694 95.91809 92.19207 97.79477 74.15575 107.05222 86.84263
## [666] 95.89183 77.38350 88.45881 94.45727 85.23815 93.65194 84.67616
## [673] 87.77595 82.32072 104.78599 90.77867 100.05163 96.96440 76.26077
## [680] 99.96360 109.15124 109.12921 85.53802 101.44082 144.70644 97.41799
## [687] 115.78542 99.83179 91.50322 108.41696 104.75829 118.47055 99.39962
## [694] 111.61677 92.61357 101.72193 84.65056 96.79826 93.77592 102.85408
## [701] 116.54398 97.55262 110.49365 97.13399 81.67777 87.81940 100.32151
## [708] 103.68255 86.35637 81.01280 78.51323 91.01882 79.76489 90.41515
## [715] 100.58154 92.36438 98.90659 100.07076 93.96008 109.53966 114.82484
## [722] 77.10113 105.13734 100.06428 115.99112 131.93850 79.59447 92.63625
## [729] 81.70294 107.24180 119.78090 102.53089 110.22773 94.25248 73.14096
## [736] 108.34384 95.30761 115.57185 105.34667 98.60629 98.51229 105.77689
## [743] 108.97143 105.40027 117.08709 91.12409 97.74929 100.55697 84.78179
## [750] 85.95133 126.27207 133.92810 100.63484 104.10215 115.10413 111.30980
## [757] 117.23917 108.15457 94.20896 81.24105 110.33504 103.59394 102.45498
## [764] 99.01817 124.78870 109.57230 92.17517 93.87101 113.84112 105.28493
## [771] 104.13426 100.42629 111.19874 106.31693 112.13090 95.13974 107.18964
## [778] 99.71079 80.77707 89.96694 117.48942 93.71503 106.36693 116.03716
## [785] 112.58839 79.76418 83.31039 119.30759 109.62970 102.01498 75.85219
## [792] 72.29496 102.38959 84.54580 83.78673 84.82270 76.91264 89.97703
## [799] 104.89948 105.23601 79.38732 85.41793 87.82973 119.36090 98.75368
## [806] 108.46303 93.53961 87.99556 103.95507 119.48933 112.92449 68.25639
## [813] 111.81482 123.78727 99.27525 92.99595 88.90288 127.97751 113.12320
## [820] 84.16820 94.77948 85.72290 93.84984 83.75710 101.81914 94.86361
## [827] 102.24249 97.44980 112.78512 113.64455 90.13069 101.74229 104.64577
## [834] 105.97797 78.53685 84.92680 101.37609 113.77302 99.47423 107.35905
## [841] 93.61080 96.61608 83.45744 92.87366 93.44405 104.45977 83.10836
## [848] 113.79341 117.54922 104.38934 82.08580 79.31577 90.95491 117.78726
## [855] 105.33701 110.86880 115.76697 93.32478 95.06207 113.45264 88.72373
## [862] 115.58796 84.79670 102.99173 90.64495 78.01201 100.41420 105.24186
## [869] 91.44065 103.75860 83.71579 103.85660 93.04039 93.64181 107.28817
## [876] 89.04809 107.53074 115.70570 96.50619 117.60370 102.31978 107.41010
## [883] 90.13873 81.52023 113.22903 111.00971 85.99148 116.34324 104.00544
## [890] 95.41548 109.59264 98.50567 92.23996 100.04702 96.76548 92.80180
## [897] 132.33876 108.42989 99.22534 90.04661 101.76260 123.40318 90.44727
## [904] 92.92518 84.46304 110.04876 100.09729 86.01696 75.37641 100.96975
## [911] 125.62997 93.00222 132.96192 92.68903 117.10552 91.99341 104.15062
## [918] 132.78468 104.01031 107.70244 97.49159 115.09319 88.42970 84.75383
## [925] 101.33984 104.97614 102.12685 77.54748 100.31834 99.93531 83.81903
## [932] 80.02108 99.36971 89.63255 113.38794 121.50725 95.88037 86.49634
## [939] 109.20163 91.82817 110.27048 87.90476 103.60809 113.56643 79.67539
## [946] 86.35615 103.25490 110.98109 95.18692 102.45077 89.72522 89.85228
## [953] 108.23344 106.89989 102.14495 83.42997 122.96897 95.21207 115.65962
## [960] 84.60625 99.43685 102.26021 92.30453 106.25687 75.70325 103.35896
## [967] 108.92397 109.98647 77.43205 124.61797 122.69624 100.90407 109.48374
## [974] 98.87322 91.42835 86.30455 105.41535 100.38431 86.25766 105.16734
## [981] 87.55284 93.51535 95.27562 108.10308 101.96264 90.00940 103.62418
## [988] 96.60810 112.85930 81.77893 82.63769 104.58915 109.35365 84.33653
## [995] 123.06861 101.03731 94.44056 89.60087 108.01350 118.78849
plot(apply(my_df, 2, var), type = "o", pch = 21, bg = "orange")
We can learn a bit more by creating longer lists of repeating numbers, which will show us some of the impacts of sample size. Below, we make a short list of numbers with a mean and median of 9, then we repeat that list of numbers 100 times, taking the variance and median absolute deviation. The code below gives us three plots: one showing variance/MAD across the 100 repetitions of the number string; one showing a histogram of this data, and another showing frequency distributions as displayed using a combination of “plot()” and “table()” methods.
test_range = c(5:13)
plot_var = sapply(1:100, function(x){
var(rep(test_range, x))
})
plot_mad = sapply(1:100, function(x){
mad(rep(test_range, x))
})
par(mfrow = c(1, 3))
plot(plot_var, ylim = c(0,10), pch = 21, bg = "orange")
lines(plot_mad)
hist(rep(test_range, 100))
plot(table(rep(test_range,100)))
par(mfrow = c(1, 1))
Regarding variance, we see a minor impact of the total sample size on our measurement of variance, but it quickly approaches an asymptote after being doubled three or four times. The distribution of this data is even and we see that increasing the number of observations (so long as they’re the same numbers) does not impact our estimation of the mean or median. Will this hold for a set of numbers where the frequency distribution isn’t even?
norm_range = #Using <shift+enter>, we can already make a visualization of the counts...
c(5, 5,
6, 6, 6,
7, 7, 7, 7,
8, 8, 8, 8, 8,
9, 9, 9, 9, 9, 9,
10, 10, 10, 10, 10,
11, 11, 11, 11,
12, 12, 12,
13, 13)
plot_var = sapply(1:100, function(x){
var(rep(norm_range, x))
})
plot_mad = sapply(1:100, function(x){
mad(rep(norm_range, x))
})
par(mfrow = c(1, 2))
plot(plot_var, ylim = c(floor(min(plot_mad)) - 0.5, ceiling(max(plot_var)) +0.5), pch = 21, bg = "orange")
lines(plot_mad)
plot(table(rep(norm_range,100)))
par(mfrow = c(1, 1))
Let’s try this with a v-shaped distribution.
v_range = c(
5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6,
7, 7, 7, 7,
8, 8, 8,
9, 9,
10, 10, 10,
11, 11, 11, 11,
12, 12, 12, 12, 12,
13, 13, 13, 13, 13, 13)
plot_var = sapply(1:100, function(x){
var(rep(v_range, x))
})
plot_mad = sapply(1:100, function(x){
mad(rep(v_range, x))
})
par(mfrow = c(1, 2))
plot(plot_var, ylim = c(floor(min(plot_mad)) - 0.5, ceiling(max(plot_var)) +0.5), pch = 21, bg = "orange")
lines(plot_mad)
plot(table(rep(v_range,100)))
par(mfrow = c(1, 1))
The replication structure we’re using here will serve us in the future. Let’s codify the repetitions and plotting into a function to save us all this copy-pasting.
stat_lines = function(x,#Creating a function called "stat_lines" it expects some vector of data "x" and the following arguments
times = length(x), #Setting this to "length(x)" by default, so calling this won't always be necessary. But we can add our own numbers!
title = NA){ #Setting up a title object, defaulting to NA
plot_u = sapply(1:times, function(y){ #Here's that fast custom function that creates means for the repeated set of numbers.
mean(rep(x, y))
})
plot_med = sapply(1:times, function(y){ #Same, but for the median.
median(rep(x, y))
})
plot_var = sapply(1:times, function(y){ #Same, but for variance.
var(rep(x, y))
})
plot_mad = sapply(1:times, function(y){ #Same, but for median absolute deviation
mad(rep(x, y))
})
stat_labels = c("mean", "median", "variance", "median abs. dev.")
plot_stats = data.frame(plot_u, plot_med, plot_var, plot_mad)
plot(0, 0, xlim = c(0, times), ylim = c(0, max(plot_stats)+1), pch = NA, xlab = "N repeats", ylab = "value", main =paste0(title," Repeated Statistics"))
for(i in 1:ncol(plot_stats)){
points(plot_stats[, i], type = "o", pch = 21, lty = i, bg = i, cex = 0.7)
text(times*0.05+(i*(0.2*times)), min(plot_stats[, i])+(0.04*max(plot_stats)), labels = stat_labels[i])
}
}
stat_lines(v_range, title = "V-Shaped Range")
This makes plotting this much easier, allowing us to make some quick comparisons between our two small datasets, which we are making much larger using the “rep()” function.
par(mfrow = c(1,2))
stat_lines(v_range, title = "V-Shaped Range", times = 100)
stat_lines(norm_range, title = "Normal Range", times = 100)
par(mfrow = c(1,1))
We can apply this same approach to some of the fake core data.
par(mfrow = c(2,2))
stat_lines(fake_core$Cyperaceae.undiff., title = "Cyperaceae")
stat_lines(fake_core$Typha, title = "Typha")
stat_lines(fake_core$Alchornea, title = "Alchornea")
stat_lines(fake_core$Guibourtia.demeusei, title = "Guibourtia")
par(mfrow = c(1,1))
What all of this demonstrates is that:
Thus, to standardize our estimate of deviation, we take the square root of variance.
sd(fake_core$Cyperaceae.undiff.)
## [1] 10.25495
sd_cyp = sqrt(var(fake_core$Cyperaceae.undiff.))
sd_cyp
## [1] 10.25495
Let’s see how it performs when we systematically manipulate the data again.
par(mfrow = c(1, 2))
my_df = sapply(1:10, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
seq(x, x*100, x)
})
#apply(my_df, 2, sd)
plot(apply(my_df, 2, sd), type = "o", pch = 21, bg = "orange")
lines(apply(my_df, 2, mad))
my_df = sapply(1:1000, function(x){
seq(x, x*100, x)
})
plot(apply(my_df, 2, sd), type = "o", pch = 21, bg = "orange")
lines(apply(my_df, 2, mad))
par(mfrow = c(1, 1))
Let’s add standard deviations to our basic plots and see where they fall.
stat_lines = function(x,#Creating a function called "stat_lines" it expects some vector of data "x" and the following arguments
times = length(x), #Setting this to "length(x)" by default, so calling this won't always be necessary. But we can add our own numbers!
title = NA){ #Setting up a title object, defaulting to NA
plot_u = sapply(1:times, function(y){ #Here's that fast custom function that creates means for the repeated set of numbers.
mean(rep(x, y))
})
plot_med = sapply(1:times, function(y){ #Same, but for the median.
median(rep(x, y))
})
plot_var = sapply(1:times, function(y){ #Same, but for variance.
var(rep(x, y))
})
plot_mad = sapply(1:times, function(y){ #Same, but for median absolute deviation
mad(rep(x, y))
})
plot_sd = sapply(1:times, function(y){
sd(rep(x, y))
})
stat_labels = c("mean", "median", "variance", "median abs. dev.", "SD") #Because we automate everything below, all we have to do is modify the names.
plot_stats = data.frame(plot_u, plot_med, plot_var, plot_mad, plot_sd) #We're using the plot_stats dataframe below, so we modify the entry here.
plot(0, 0, xlim = c(0, times), ylim = c(0, max(plot_stats)+1), pch = NA, xlab = "N repeats", ylab = "value", main =paste0(title," Repeated Statistics"))
for(i in 1:ncol(plot_stats)){
points(plot_stats[, i], type = "o", pch = 21, lty = i, bg = i, cex = 0.7)
text(times*0.05+(i*(0.2*times)), min(plot_stats[, i])+(0.04*max(plot_stats)), labels = stat_labels[i])
}
}
Let’s plot our selected taxa again.
par(mfrow = c(2,2))
stat_lines(fake_core$Cyperaceae.undiff., title = "Cyperaceae")
stat_lines(fake_core$Typha, title = "Typha")
stat_lines(fake_core$Alchornea, title = "Alchornea")
stat_lines(fake_core$Guibourtia.demeusei, title = "Guibourtia")
par(mfrow = c(1,1))
These are useful plots for showing us the basic contours of our data.
par(mar = c(5, 6, 4, 2) + 0.1)
boxplot(fake_core,
horizontal = TRUE,
las = 2,
cex.axis = 0.5)
par(mar = c(5, 4, 4, 2) + 0.1)
There are some more useful ways to visualize our data, however. We can use kernel density plots to look at overall distributions of data in a more detailed way. This also helps us get a sense of how normally distributed the data actually are.
dens_plot = lapply(fake_core, function(x){
density(x)
})
par(mar = c(5, 6, 4, 2) + 0.1)
boxplot(fake_core,
horizontal = TRUE,
las = 2,
cex.axis = 0.5,
main = "Boxplots and Kernel Densities")
for(i in 1:length(dens_plot)){
ylines = (dens_plot[[i]][['y']]/max(dens_plot[[i]][['y']]))*0.5
lines(dens_plot[[i]][['x']], ylines + i, lty = 3, col = "blue")
}
par(mar = c(5, 4, 4, 2) + 0.1)
You could also use the violplot method, which uses these densities to create boxplots where the margins are polygons shaped like the densities.
par(mar = c(5, 6, 4, 2) + 0.1)
vioplot::vioplot(fake_core,
horizontal = TRUE,
las = 2,
cex.axis = 0.5,
col = "gold")
for(i in 1:length(dens_plot)){
ylines = (dens_plot[[i]][['y']]/max(dens_plot[[i]][['y']]))*0.5
lines(dens_plot[[i]][['x']], ylines + i, lty = 3, col = "blue")
}
par(mar = c(5, 4, 4, 2) + 0.1)